Olive Garden’s Never-Ending Pasta
Bowl is back, baby! Clarksville only has one location, and
here it is:
The Never-Ending Pasta Bowl is back at Olive Garden. Last week, several of us decided to go take advantage of this deal. While we were there, I collected some data. This data set includes names, labs, number of pasta bowls eaten, number of breadsticks eaten, type of appetizer ordered, number of drinks drank, level of hunger upon entering Olive Garden, and whether or not leftovers were taken home for each individual who participated. Hopefully I can run some models with this data. If not, I have some backup data to use that I found online. But, pasta data seems more fun than health insurance cost data, so fingers crossed.
Multicolinearity means that once you know the effect of one predictor, the value of knowing the other is low. If a variable has a high level of variance inflation (greater than 6), you can get rid of it, as it is not adding to models. Let’s go ahead and check our data out:
I have a pretty small data frame, but it seems fine. Maybe even semi-parsimonious.
Let’s go ahead and build some candidate models for model selection. I’ll use number of pasta bowls eaten as my response variable, as predicted by various other variables (number of breadsticks eaten, hunger, appetizer, lab, leftovers, etc.)
# Build multiple candidate models to compare. All of these models are built to explain some variation in the response variable using different predictor variables. Remember, all of these models need to have the same error structure.
mod1<-lm(no.pasta~no.bread, data = pasta)
mod2<-lm(no.pasta~no.bread+hunger, data = pasta)
mod3<-lm(no.pasta~lab+app, data = pasta)
mod4<-lm(no.pasta~lab+hunger+leftovers, data = pasta)
mod5<-lm(no.pasta~hunger+app, data=pasta)
Now that we have five “candidate models” built, we can use the model.sel function to conduct some model selection. The output for this function will rank them by AICc, since I have a small sample size. We can look at AICc, delta, and weight values to interpret the results.
# Compare the models, using model.sel function. Output will contain all model selection information in one object.
out.put<-model.sel(mod1,mod2,mod3,mod4,mod5)
out.put
## Model selection table
## (Int) no.brd hng app lab lft df logLik AICc delta weight
## mod1 2.887 -0.2529 3 -16.626 43.3 0.00 0.871
## mod2 1.526 -0.2565 + 4 -15.935 47.9 4.62 0.087
## mod5 0.500 + + 4 -16.650 49.3 6.05 0.042
## mod3 2.000 + + 6 -13.329 66.7 23.41 0.000
## mod4 -0.500 + + + 7 -0.326 70.7 27.40 0.000
## Models ranked by AICc(x)
Here, model 1, which simply used the number of breadsticks eaten to predict the number of bowls of pasta eaten, is ranked as the top model. It has an AICc value of 43.3, which is the lowest of the four and is more than two units from the next-best model, which tells us that the first and second models are not comparable. Its weight value is a whopping 0.871, which means that if the best model for this data is here, there is a 87.1% chance that model 1 is it. The next two models in the ranking, models 5 and 2, are comparable, as their AICc values are within 2 units of each other. They share similar delta and weight values as well. Models 3 and 4 seem to suck.
But, our best model might not even be included in the five candidate models. Let’s see.
Building a global model allows us to find all possible variable combinations and models by fitting a model with all parameters. We can use the dredge function to do this. The output is typically hefty, so beware.
## Global model call: lm(formula = no.pasta ~ no.bread + app + no.drink + hunger +
## leftovers, data = pasta)
## ---
## Model selection table
## (Int) app hng lft no.brd no.drn df logLik AICc delta weight
## 5 4.0000 + 3 -12.730 35.5 0.00 0.469
## 21 2.7500 + 0.8333 4 -10.020 36.0 0.58 0.351
## 7 3.0710 + + 4 -12.198 40.4 4.94 0.040
## 1 1.8500 2 -17.405 40.5 5.06 0.037
## 13 4.1540 + -0.06164 4 -12.636 41.3 5.81 0.026
## 6 4.1140 + + 4 -12.682 41.4 5.90 0.024
## 17 0.6000 0.8333 3 -16.508 43.0 7.56 0.011
## 9 2.8870 -0.25290 3 -16.626 43.3 7.79 0.010
## 3 0.5000 + 3 -16.843 43.7 8.22 0.008
## 2 1.6870 + 3 -17.120 44.2 8.78 0.006
## 23 2.3470 + + 0.7797 5 -9.795 44.6 9.13 0.005
## 22 2.8640 + + 0.8333 5 -9.938 44.9 9.42 0.004
## 29 2.5620 + 0.04688 0.8802 5 -9.939 44.9 9.42 0.004
## 11 1.5260 + -0.25650 4 -15.935 47.9 12.41 0.001
## 25 1.6350 -0.18490 0.6484 4 -16.077 48.2 12.69 0.001
## 18 0.4375 + 0.8333 4 -16.164 48.3 12.87 0.001
## 10 2.7700 + -0.27060 4 -16.169 48.3 12.88 0.001
## 19 -0.2105 + 0.7105 4 -16.173 48.3 12.89 0.001
## 15 3.2160 + + -0.07495 5 -12.044 49.1 13.63 0.001
## 8 3.1840 + + + 5 -12.096 49.2 13.73 0.000
## 4 0.5000 + + 4 -16.650 49.3 13.84 0.000
## 14 4.2080 + + -0.05345 5 -12.617 50.2 14.77 0.000
## 12 1.5820 + + -0.27060 5 -15.595 56.2 20.73 0.000
## 26 1.5590 + -0.20390 0.6294 5 -15.601 56.2 20.74 0.000
## 27 0.8278 + -0.20440 0.4900 5 -15.608 56.2 20.76 0.000
## 20 -0.2288 + + 0.7288 5 -15.914 56.8 21.37 0.000
## 24 2.4500 + + + 0.7750 6 -9.670 59.3 23.88 0.000
## 31 2.2490 + + 0.03304 0.8172 6 -9.754 59.5 24.05 0.000
## 30 2.6440 + + 0.06719 0.9005 6 -9.782 59.6 24.10 0.000
## 16 3.2740 + + + -0.06289 6 -11.994 64.0 28.53 0.000
## 28 0.8763 + + -0.21810 0.4960 6 -15.235 70.5 35.01 0.000
## 32 2.3160 + + + 0.05396 0.8350 7 -9.567 89.1 53.67 0.000
## Models ranked by AICc(x)
# Performance check:
performance::check_collinearity(all.params) # Low correlation detected - good to go!
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## no.bread 1.52 [1.27, 1.99] 1.23 0.66 [0.50, 0.79]
## app 1.29 [1.12, 1.70] 1.13 0.78 [0.59, 0.90]
## no.drink 1.27 [1.10, 1.68] 1.13 0.79 [0.59, 0.91]
## hunger 1.15 [1.03, 1.62] 1.07 0.87 [0.62, 0.97]
## leftovers 1.57 [1.31, 2.06] 1.25 0.64 [0.49, 0.76]
Like I said, the output is kind of hefty. Let’s first subset these results to only include models within five units of the top model (delta < 5) to filter out most of the less-relevant models.
## Global model call: lm(formula = no.pasta ~ no.bread + app + no.drink + hunger +
## leftovers, data = pasta)
## ---
## Model selection table
## (Int) hng lft no.drn df logLik AICc delta weight
## 5 4.000 + 3 -12.730 35.5 0.00 0.546
## 21 2.750 + 0.8333 4 -10.020 36.0 0.58 0.408
## 7 3.071 + + 4 -12.198 40.4 4.94 0.046
## Models ranked by AICc(x)
Now, let’s look at the top model only. The top model has a delta value of 0.
## Global model call: lm(formula = no.pasta ~ no.bread + app + no.drink + hunger +
## leftovers, data = pasta)
## ---
## Model selection table
## (Int) lft df logLik AICc delta weight
## 5 4 + 3 -12.73 35.5 0 1
## Models ranked by AICc(x)
Model 5 has been selected as the best-fitted model for our data. Apparently, whether or not someone took leftovers home was a highly relevant variable in this set. Model 5 has an AICc value of 35.4, and it is comparable to the next best model, which was model 21. Model 21 included variables of the number of drinks someone had, as well as whether or not they took home leftovers. Model 21’s AICc value was 36.
Delta values for the top two models were 0.546 and 0.408, respectively. Clearly, these models have similar explanatory power when it comes to predicting the number of bowls of pasta someone ate.
Let’s see which variables are the most important for this data set. We can find importance weights for individual predictor variables with the sw function.
## leftovers no.drink hunger no.bread app
## Sum of weights: 0.92 0.38 0.06 0.04 0.04
## N containing models: 16 16 16 16 16
I can’t believe the number of breadsticks eaten is ranked lower than the number of drinks someone had! This surprises me, personally.
I guess it’s time to include a figure in this assignment. There is no escaping ggplot!
p1 / p2
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Since breadstick eating didn’t seem to be ranked as the best predictor variable for number of pasta bowls eaten… maybe we should try with chips and salsa. Who’s with me?